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Abstract 


A novel, human-infecting coronavirus causing COVID-19 was first identified in 
Wuhan, China in December 2019. Within a short span of time the virus recorded 
more than I million deaths, worldwide. This study addresses the overall 
evolutionary process from complete genomes of COVID-19. Addressing the 
complexity of the task, network-based approaches were used in mapping samples 
to their reported locations. A total of 473 complete human-coronavirus genomes 
from 20 different countries were studied, including samples from 17 states of the 
United States and samples from the Cruise-Diamond Princess. The phylodynamic 
network of a global scale was classified into five clusters containing two clusters 
of the samples from the USA. Cluster B was a shared cluster of samples from 
China and the USA, while clusters A and C were of a diverse nature. Chinese 
samples aggregated in clusters A and B which aided in retaining the 
homogeneous viral genomic pool. In contrast, samples from the USA and Spain 
were split into distinct clusters which indicated multiple port entries and a 
possibility of implying a delay in quarantine measures. In the intra-USA samples, 
we found that sequences reported from Washington and Virginia were scattered 
indicating evolutionary diversity. This report provides an insight into the 
transmission pattern of CoV2, which is complicated to evaluate exclusively 
through the conventional surveillance means. 


Keywords: human-coronavirus, phylodynamic network, evolutionary diversity, 
genomic pool 


Introduction 


A novel, human-infecting coronavirus called SARS-CoV2 causing COVID-19 
was first identified with the use of next-generation sequencing in Wuhan, China 
in late December 2019 [1]. Contagion in medical workers and family clusters was 
also reported confirming human-to-human transmission [2]. Patients infected with 
COVID-19 exhibit a high fever, sore throat, dyspnea, with invasive lesions 
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present in both lungs as revealed by chest radiography [2, 3]. Within a period of 4 
months the virus spread to more than 210 countries becoming an international 
emergency where European Region, Region of the Americas, Western Pacific and 
Eastern Mediterranean Region were the worst affected. As of April 13, 2020, 
more than 1773084 confirmed cases were reported around the world, with 111640 
fatalities (www.cdc.gov). SARS-CoV2 is an RNA virus due to which it has high 
mutation rate which alternatively allows for estimating the underlying genealogy 
connecting sampled viruses [4]. SARS-CoV2 shares 96.3% of genetic similarity 
with the bat coronavirus RaTG13, which was obtained from bats in Yunnan in 
2013 and is used as an out group in recent studies [5]. Identifying the origin and 
transmission pattern of such a pathogen is imperative to block the means of 
further spread [6]. 


Several approaches are being employed to combat the pandemic. Treatment 
with antiviral drugs, chloroquine, corticosteroids, and convalescent plasma 
transfusion are being tested with limited success [7-12]. Development of a 
potential vaccine is a time-consuming process and till then conventional public 
health procedures, such as isolation, quarantine, community distancing and social 
containment, can be used to stop the spread of this viral disease [13]. In order to 
successfully test these tactic phylogenetic methods they can be employed in 
clinical studies to investigate the pathogen spread in an individual and within 
communities. Moreover, understanding the global transmission and phylodynamic 
pattern of COVID-19 can assist in tracking undocumented infection sources and 
trace the route of infection transmission. New cases are being reported every day 
and with that sequencing data is also readily accessible. In our study we included 
sequence entries from 20 different countries, analyzed and mapped 473 complete 
CoV2 genomes and connected those through network-based distances retrieved 
from whole-genome sequencing. 


Materials & Methods 


All the sequences used in this study are retrieved from the NIH NCBI Virus 
database (http://www.ncbi.nlm.nih.gov/labs/virus). Entries with incomplete 
genome were removed and we were left with a final dataset of 473 sequences, 
containing 355-USA, 71-China, 18-Spain, 04-Korea, 03-Taiwan, 02 sequences 
each from India, Pakistan, Vietnam, Nepal, Israel and Iran and 01 sequence each 
from Australia, Finland, France, Peru, Brail, Japan, Sweden and Colombia. Prior 
to perform an alignment with MAFT online server [14], the non-genomic 
alphabets were removed. Including the Bat corona sequence, alignment was 
performed with a strategy mode keeping 1PAM/k=2 substitution matrix and 1.53 
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gap-penalty score. Alignment file was manually adjusted by removing the 5’- 
prime 30-40 nucleotides and 3’-prime poly-A sequences. The aligned dataset was 
transported to MEGA for generation to time-tree where neighbor-joining 
approach was utilized [15]. The DNASp6 packages were utilized for the data 
format conversion purposes [16]. The PopART version 1.7 was used to convert 
all the time trees into median joining network using the epsilon value of “0” and 
the final networks were drawn with iteration value of 5000 [17, 18]. For graphical 
manipulations, the Microsoft package paint.net was considered. To reproduce this 
data, the alignment file of all 473 genomes can be accessed from supplementary 
section. 


Results 


To understand the spread and evolving dynamics of CoV2, all the genomes 
available were mapped on NCBI virus database (www.ncbi.nlm.nih.gov/labs/ 
virus). Total of 473 complete CoV2 genomes comprising of sequence entries from 
20 different countries were selected for analyses. Based on available reports, Bat- 
CoV genome was used as an out group source [5]. Our analyses remained 
consistent with other reports which show that samples from Wuhan (MT291831) 
and Shenzhen/Hongkong (MN975262) are closest to the source. The former 
sample spread out into two clusters A and B engaging three samples (MN997409- 
Arizona, MT106054-Texas and MN938384-Hongkong/Shenzhen) to connect with 
cluster B and one sample, MT304489-Taxas for cluster A, sharing one and four 
mutations each (Figure 1). 


For better understanding, we have classified the whole network into five 
clusters, where the distant U1 and U2 are rich in samples of the USA. Cluster B is 
mainly a shared cluster of China and USA while A and C are diverse. 


The center of cluster A is shared by samples from USA, China and Taiwan 
while the Chinese source shares ancestry (two mutations each) to Colombian 
(MT256924) and Indian (MT050493) sample respectively. The sample from 
Taiwan provides a sole out group (MN985325) to cluster Ul which densely 
contains the sequences from Washington DC, USA. Cluster B is heavily centered 
to USA and China and provides direct descendants to Vietnam, Israel, India, 
Pakistan, Italy, Nepal, Australia, Sweden and Korea sharing one to four 
mutations. Interestingly, the Swedish sample uses Australian node rather than 
Chinese. Second cluster of the USA, U2 is connected to cluster B by a rather 
small cluster C that contained European and South American samples from Spain, 
France and Peru. The French sample of cluster C provides an out group to the U2 
cluster that contained sequences from different states of the USA. 
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Collectively, our global scale CoV2 spreading dynamics indicate countries 
with multiple or different source entries that are assisting viral evolution at a rapid 
phase. 


Figure 1. Global Scale Phylodynamics Mapping of the Cov2 Samples 
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All the representative entries are colored to geographical locations. Smaller 
black nodes are arbitrary links while the number of cross-hatches on individual 
branches indicates the number of mutations. Bold representation of branch length 
denotes the ancestral Bat connection. 
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Phylodynamics of the USA 

Until April 13, 2020 there were more than 400 sequences from the USA. Here, we 
have analyzed the 355 complete genome samples of the USA reported from 
seventeen different states including 24 samples from the Cruise Ship Diamond 
Princess that had 3771 passengers on board out of which more than 700 
confirmed cases of CoV2 [19]. Since the cruise was carrying CoV2 positive 
patients from Hongkong, we used Bat-CoV genome as an out group. To our 
interest, the cruise samples grouped next to the ancestor, here we call it Cruise- 
cluster, along with the Cruise-cluster one sample each from Oregon (OR, 
MT304487) and Texas (TX, MT276331) stayed closer to the ancestor (Figure 2). 


Figure 2. Phylodynamics Updates of the Cov2 Population Reported from the 
USA 
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The data includes the representation of 17 different states and samples are 
collected from the Cruise-diamond princess and reported from the USA. States 
include Washington, California, Minnesota, Arizona, Rhodes Island, 
Massachusetts, Texas, Michigan, North Carolina, Wisconsin, Virginia, New York, 
Georgia, Oregon, Florida, Illinois and New Hampshire and are represented by 
short names WA, CA, MN, AZ, RI, MA, TX, MI, NC, WI, VA, NY, GA, OR, FL, IL 
and NH and different colors. 


The OR sample provides a base for one sample each for California (CA), 
Georgia (GA) and five for Washington (WA). The central base of the Cruise- 


Department of Knowledge and Research Support Services O 13 
Volume 1 Issue 1, 2021 Nae! 


Evolutionary Frequency of Initially Sequenced... 


cluster 1s shared with the Arizonian sample directly infected from China 
(discussed above). Overall, the C-cluster shares similarity with majority of the 
samples from CA and further bifurcated. The left side group of WA samples is in 
the same group we previously mentioned as U1 and is connected by an arbitrary 
ancestor to the C-group suggesting that cruise samples are not the direct source 
for U1. Ultimately the only valid source left is from Taiwan. Similar case can be 
observed in the right cluster where the Cruise-cluster is not providing an actual 
ancestral link. 


Discussion 


Previously, phylodynamic is used to describe immunodynamics, epidemiology, 
and evolutionary biology’ to understand how infectious diseases are transmitted 
and evolved [20]. A variety of evolutionary models assumes a tree to facilitate the 
testing and discussion of hypotheses. However, the increase in population size as 
a complex evolutionary scenario is poorly described by such models [21]. Such 
limitations have led to the development of a number of different types of 
phylogenetic networks. To estimate the evolutionary frequency of the available 
human CoV2 genomes and map them on to the geographical locations the study 
presents the analysis through median-joining network. 


Analyzing the global scale evolution and spread of human CoV2, we have 
noticed the presence of Chinese samples only in cluster A and B highlighting the 
efficacy of tight quarantine practices of Chinese citizens that proved to be 
efficient in retaining the homogeneous viral genomic pool. On the other hand, 
samples from the USA were split into distinct clusters indicating multiple port 
entries of the virus and implying a delay in quarantine measures. Although USA 
had restrictions in place on all the traffic coming from China but such measures 
were not applied to the traffic coming from rest of the world, hence the virus was 
not contained as efficiently as it was contained in China. A similar phenomenon 
was observed in Spanish samples located in three different clusters (A, B and C) 
and shares ancestors from Taiwan, China, USA and Israel separately. Contrary, 
genomes reported from the USA population indicate that the passengers from the 
Cruise Diamond Princess were efficiently quarantined and treated and are not the 
major source for the spread of infection in the USA. The clustering of the cruise 
samples near the ancestral node are justified by two main reasons. Firstly, 
passengers were carrying the virus from the epicenter, China and secondly, they 
remained isolated inside the cruise which restricted viral evolution. Specifically, 
sequences of WA and VA have shown diversity and are scattered almost in every 
cluster. Overall, our data emphasizes that the CoV2 spread is higher in the USA 
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due to heterogeneity in viral pool when compared with the rest of the affected 
countries. Besides the US government needs to take some strict measures to keep 
the viral spread limited to the source by restricting the free movements of the 
citizens. 


Conclusions 

This study design mainly emphasizes on two factors which include, quarantine 
measures and viral spread across the borders. In order to present a fair idea, whole 
genome sequences used in this study were selected from the initial three months 
only. Based on the resultant clusters, our findings suggest the importance of 
quarantine measures. Samples from the countries which took timely measures 
stayed converged as compared to others. This study has used samples from the 
Cruise Diamond Princess as a control with other samples sequences from 
different states of the USA that further strengthen the notion that stringent and 
timely quarantine measures could have prevented the calamitous spread of virus 
and have saved several lives lost to this pandemic. Overall, this study design will 
be helpful for the policy makers and scientists to counter other pandemics in 
future in the developed world. This can also be considered a beacon of hope for 
underdeveloped countries to fight this pandemic where health facilities are limited 
and vaccine is currently unavailable. 
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